Search CORE

114 research outputs found

Identification of disease-causing genes using microarray data mining and gene ontology

Author: A Mohammadi
A Zhang
AA Alizadeh
Azadeh Mohammadi
B Duval
BF Souza
C Ambroise
C Ding
C Tago
D Lin
D Singh
E Martinez
FM Couto
I Guyon
I Inza
J Jaeger
JJ Jiang
L Li
L Yu
L Ziaei
Mansoor Salehi
Mohammad H Saraee
N Cristianini
P Pavlidis
P Resnik
PA Mundra
PA Mundra
PJ Park
R Genuer
RF Weaver
S Li
S Li
TM Huang
TR Golub
TS Furey
U Alon
W Xu
Y Ding
Y Saeys
Y Wang
YL Chin
Z Xie
Z Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Background: One of the best and most accurate methods for identifying disease-causing genes is monitoring gene expression values in different samples using microarray technology. One of the shortcomings of microarray data is that they provide a small quantity of samples with respect to the number of genes. This problem reduces the classification accuracy of the methods, so gene selection is essential to improve the predictive accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVMRFE) has become one of the leading methods, but its performance can be reduced because of the small sample size, noisy data and the fact that the method does not remove redundant genes. Methods: We propose a novel framework for gene selection which uses the advantageous features of conventional methods and addresses their weaknesses. In fact, we have combined the Fisher method and SVMRFE to utilize the advantages of a filtering method as well as an embedded method. Furthermore, we have added a redundancy reduction stage to address the weakness of the Fisher method and SVMRFE. In addition to gene expression values, the proposed method uses Gene Ontology which is a reliable source of information on genes. The use of Gene Ontology can compensate, in part, for the limitations of microarrays, such as having a small number of samples and erroneous measurement results. Results: The proposed method has been applied to colon, Diffuse Large B-Cell Lymphoma (DLBCL) and prostate cancer datasets. The empirical results show that our method has improved classification performance in terms of accuracy, sensitivity and specificity. In addition, the study of the molecular function of selected genes strengthened the hypothesis that these genes are involved in the process of cancer growth. Conclusions: The proposed method addresses the weakness of conventional methods by adding a redundancy reduction stage and utilizing Gene Ontology information. It predicts marker genes for colon, DLBCL and prostate cancer with a high accuracy. The predictions made in this study can serve as a list of candidates for subsequent wet-lab verification and might help in the search for a cure for cancers

University of Salford Institutional Repository

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Random forest methodology for model-based recursive partitioning: the mobForest package for R

Author: A Zelleis
Barry Eggleston
C Strobl
E Bauer
Georgiy Bobashev
L Breiman
L Breiman
L Breiman
Nikhil R Garge
P Buhlmann
R Genuer
RC Team
RF Anton
TG Dietterich
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

An experimental study of the intrinsic stability of random forest variable importance measures

Author: A Altmann
A Kalousis
A Statnikov
A Statnikov
A Verikas
AC Haury
AL Boulesteix
AL Boulesteix
CH Park
D Ma
DM Reif
DS Cao
EC Fieller
Fan Yang
H Wang
Huazhen Wang
I Guyon
I Kamkar
J Paul
JM Cadenas
KK Nicodemus
L Breiman
L Hamers
L Yu
L Yu
LI Kuncheva
MB Kursa
ML Calle
O Okun
R Díaz-Uriarte
R Fagin
R Genuer
S Alelyani
S Loscalzo
S Pleus
SS Lee
SY Kim
TK Ho
VY Kulkarni
Y Han
Y Zhang
Z He
Zhiyuan Luo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

BACKGROUND: The stability of Variable Importance Measures (VIMs) based on random forest has recently received increased attention. Despite the extensive attention on traditional stability of data perturbations or parameter variations, few studies include influences coming from the intrinsic randomness in generating VIMs, i.e. bagging, randomization and permutation. To address these influences, in this paper we introduce a new concept of intrinsic stability of VIMs, which is defined as the self-consistence among feature rankings in repeated runs of VIMs without data perturbations and parameter variations. Two widely used VIMs, i.e., Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG) are comprehensively investigated. The motivation of this study is two-fold. First, we empirically verify the prevalence of intrinsic stability of VIMs over many real-world datasets to highlight that the instability of VIMs does not originate exclusively from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. Second, through Spearman and Pearson tests we comprehensively investigate how different factors influence the intrinsic stability. RESULTS: The experiments are carried out on 19 benchmark datasets with diverse characteristics, including 10 high-dimensional and small-sample gene expression datasets. Experimental results demonstrate the prevalence of intrinsic stability of VIMs. Spearman and Pearson tests on the correlations between intrinsic stability and different factors show that #feature (number of features) and #sample (size of sample) have a coupling effect on the intrinsic stability. The synthetic indictor, #feature/#sample, shows both negative monotonic correlation and negative linear correlation with the intrinsic stability, while OOB accuracy has monotonic correlations with intrinsic stability. This indicates that high-dimensional, small-sample and high complexity datasets may suffer more from intrinsic instability of VIMs. Furthermore, with respect to parameter settings of random forest, a large number of trees is preferred. No significant correlations can be seen between intrinsic stability and other factors. Finally, the magnitude of intrinsic stability is always smaller than that of traditional stability. CONCLUSION: First, the prevalence of intrinsic stability of VIMs demonstrates that the instability of VIMs not only comes from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. This finding gives a better understanding of VIM stability, and may help reduce the instability of VIMs. Second, by investigating the potential factors of intrinsic stability, users would be more aware of the risks and hence more careful when using VIMs, especially on high-dimensional, small-sample and high complexity datasets

Crossref

Springer - Publisher Connector

Royal Holloway - Pure

PubMed Central

Principal variable selection to explain grain yield variation in winter wheat from features extracted from UAV imagery

Author: A Chang
A Comar
A Haghighattalab
A Makino
AC Kyratzis
Arun-Narenthiran Veeranampalayam-Sivakumar
B Gregorutti
BC Bowman
C Leng
CM Andersen
CN Law
D Basak
D Moravec
DK Ray
ER Hunt
F Degenhardt
F Iqbal
F Lu
FH Holman
G Bai
GC McDonald
H Orhan
Hannah Stoll
HD Bondell
I Guyon
J Bendig
J Crain
J Li
Jakob Geipel
Jiating Li
K Girma
KJ Archer
L Breiman
L Li
L Wang
L Wang
M Bhatta
M Du
M Schirrmann
MA Hassan
MA Hassan
Madhav Bhatta
Martin Kanning
MH Kalubarme
MP Labus
Nicholas D. Garst
O Mutanga
OA Montesinos-López
P Benincasa
P Bühlmann
P Rischbeck
P. Stephen Baenziger
R de Vlaming
R Genuer
R Tibshirani
Reka Howard
S Kipp
S Shafian
SC Kefauver
Senlin Guan
T Chu
T Duan
U Grömping
V Belamkar
Vikas Belamkar
Y Gu
Y Liu
Yeyin Shi
Yufeng Ge
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 01/01/2019
Field of study

Background: Automated phenotyping technologies are continually advancing the breeding process. However, collecting various secondary traits throughout the growing season and processing massive amounts of data still take great efforts and time. Selecting a minimum number of secondary traits that have the maximum predictive power has the potential to reduce phenotyping efforts. The objective of this study was to select principal features extracted from UAV imagery and critical growth stages that contributed the most in explaining winter wheat grain yield. Five dates of multispectral images and seven dates of RGB images were collected by a UAV system during the spring growing season in 2018. Two classes of features (variables), totaling to 172 variables, were extracted for each plot from the vegetation index and plant height maps, including pixel statistics and dynamic growth rates. A parametric algorithm, LASSO regression (the least angle and shrinkage selection operator), and a non-parametric algorithm, random forest, were applied for variable selection. The regression coefficients estimated by LASSO and the permutation importance scores provided by random forest were used to determine the ten most important variables influencing grain yield from each algorithm. Results: Both selection algorithms assigned the highest importance score to the variables related with plant height around the grain filling stage. Some vegetation indices related variables were also selected by the algorithms mainly at earlier to mid growth stages and during the senescence. Compared with the yield prediction using all 172 variables derived from measured phenotypes, using the selected variables performed comparable or even better. We also noticed that the prediction accuracy on the adapted NE lines (r = 0.58–0.81) was higher than the other lines (r = 0.21–0.59) included in this study with different genetic backgrounds. Conclusions: With the ultra-high resolution plot imagery obtained by the UAS-based phenotyping we are now able to derive more features, such as the variation of plant height or vegetation indices within a plot other than just an averaged number, that are potentially very useful for the breeding purpose. However, too many features or variables can be derived in this way. The promising results from this study suggests that the selected set from those variables can have comparable prediction accuracies on the grain yield prediction than the full set of them but possibly resulting in a better allocation of efforts and resources on phenotypic data collection and processing

Crossref

DigitalCommons@University of Nebraska

A Survey of Bayesian Statistical Approaches for Big Data

Author: A Akusok
A Baldominos
A Belle
A Beskos
A Bouchard-Côté
A De Mauro
A Fahad
A Gandomi
A Lee
A Lee
A Marshall
A O’Driscoll
A Siddiqa
A Vyas
AB Owen
AF Wise
AR Linero
AT Azar
AT Porter
AT Porter
AÇ Pehlivanlı
B Franke
B Liquet
B Liu
B Oancea
C Loebbecke
C Wang
C Wang
C Yang
CA McGrory
CC Drovandi
CE Rasmussen
Changwon Yoo
CK Emani
D Apiletti
D Oprea
D Talia
DB Dunson
DM Blei
DN Politis
DT Frazier
DV Shah
DW Bates
E Raguseo
ED Schifano
ET Bradlow
F Lindsten
Florian Buettner
Florian Maire
G Bello-Orgaz
G Jifa
GI Allen
GJ Lasinio
GM Allenby
H Cai
H Demirkan
H Hassani
H Kousar
HA Chipman
HH Huang
HJ Watson
I Ben-Gal
J Fan
J Roski
J Zhu
Jake Luo
JE Bibault
JJ Chen
JN Cappella
JS Rumsfeld
K Chalupka
Kath Albury
KL Mengersen
KS Divya
L Breiman
L Liu
L Mählmann
L Wang
L Yu
L Zhang
L Zhou
LG Nongxa
M Hilbert
M Viceconti
MA Suchard
Matias Quiroz
MD Assunção
MD Hoffman
MT Moores
N Moustafa
N. Chopin
NA Lazar
O Sysoev
Oliver Müller
OY Al-Jarrah
P Ducange
P Müller
P Pudlo
PF Brennan
R Bardenet
R Burrows
R Guhaniyogi
R Guhaniyogi
R Guhaniyogi
R Izbicki
RF Mansour
Richard Branch
Robin Genuer
RW Hoerl
S Atkinson
S Castruccio
S Chaudhuri
S Fosso Wamba
S Guha
S Kaisler
S Li
S Minsker
S Pandey
S Sagiroglu
S Sisson
S Srivastava
S Suthaharan
S White
SF Wamba
Shahriar Akter
Shweta Bansal
Simon I. Hay
SL Scott
SM Schennach
Sudipto Banerjee
T Magdon-Ismail
T Zhang
Tengyao Wang
TH McCormick
TJ McKinley
U Sivarajah
VD Katkar
X Zhang
XF Wang
XG Xia
Xing Ju Lee
Y Tang
Y Webb-Vargas
Y Zhang
Yang Ni
YW Teh
Z Ma
Z Sun
Z Zhang
Ziad Obermeyer
Zoubin Ghahramani
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/05/2020
Field of study

The modern era is characterised as an era of information or Big Data. This has motivated a huge literature on new methods for extracting information and insights from these data. A natural question is how these approaches differ from those that were available prior to the advent of Big Data. We present a review of published studies that present Bayesian statistical approaches specifically for Big Data and discuss the reported and perceived benefits of these approaches. We conclude by addressing the question of whether focusing only on improving computational algorithms and infrastructure will be enough to face the challenges of Big Data

arXiv.org e-Print Archive

Crossref

Queensland University of Technology ePrints Archive

Machine learning uncovers the most robust self-report predictors of relationship quality across 43 longitudinal couples studies

Author: Alan Reifman
Allison A. Vaughn
Amanda M. Vicary
Amie M. Gordon
Amy Muise
Andrea B. Horn
Anik Debrot
Bahns
Brett J. Peters
Brian G. Ogolsky
C. Rebecca Oldham
Carter
Cathy Surra
Cheryl Harasymchuk
Cheryl L. Carmichael
Claudia C. Brumbaugh
Colleen J. Allison
Courtney L. Gosnell
Dacher Keltner
Darby Saxbe
David C. de Jong
Eli J. Finkel
Emily A. Impett
Eran Bar-Kalifa
Erica B. Slotter
Erin L. Ramsdell
Eshkol Rafaeli
Esther S. Kluwer
Eva C. DeHaas
Finkel
Francesca Righetti
Gal Lazarus
Galena K. Rhoades
Genuer
Geoff MacDonald
Grace Larson
Gurit E. Birnbaum
Hagar Ter Kuile
Haran Sened
Harry T. Reis
James J. Kim
Jami Eller
Jaye L. Derrick
Jeffrey L. Kirchner
Jeffry A. Simpson
Jennifer Clarke
Jeremy P. Jamieson
Jessica A. Maxwell
Jill M. Logan
Jody Davis
Joel
Joel
Laura B. Luchies
Laura V. Machia
Lavner
Lindsey M. Rodriguez
Madoka Kumashiro
Maija Reblin
Marie-Joelle Estrada
Mariko L. Visserman
Matthew D. Hammond
Meinrad Perrez
Michael K. Coolsen
Michael R. Maniaci
Michael Reicherts
Miller
Moran Mizrahi
Natalie O. Rosen
Nickola C. Overall
Niehuis
Paul W. Eastwick
Paula R. Pietromonaco
Peggy A. Hannon
Pietromonaco
R. Chris Fraley
Rebecca J. Cobb
Rebecca L. Brock
Reuma Gadassi-Polack
Ron Rogge
Rony Pshedetzky-Shochat
Ruddy Faure
Rusbult
Sally I. Powers
Samantha Joel
Scott M. Stanley
Scott Wolf
Serena Chen
Shelly L. Gable
Shevaun Stocker
Sophie Bergeron
Sylvia Niehuis
Thery Prok
Wang
Wilhelm Hofmann
William S. Rholes
Ximena B. Arriaga
Yuthika U. Girme
Zachary G. Baker
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 01/01/2020
Field of study

Given the powerful implications of relationship quality for health and well-being, a central mission of relationship science is explaining why some romantic relationships thrive more than others. This large-scale project used machine learning (i.e., Random Forests) to 1) quantify the extent to which relationship quality is predictable and 2) identify which constructs reliably predict relationship quality. Across 43 dyadic longitudinal datasets from 29 laboratories, the top relationship-specific predictors of relationship quality were perceived-partner commitment, appreciation, sexual satisfaction, perceived-partner satisfaction, and conflict. The top individual-difference predictors were life satisfaction, negative affect, depression, attachment avoidance, and attachment anxiety. Overall, relationship-specific variables predicted up to 45% of variance at baseline, and up to 18% of variance at the end of each study. Individual differences also performed well (21% and 12%, respectively). Actor-reported variables (i.e., own relationship-specific and individual-difference variables) predicted two to four times more variance than partner-reported variables (i.e., the partner’s ratings on those variables). Importantly, individual differences and partner reports had no predictive effects beyond actor-reported relationship-specific variables alone. These findings imply that the sum of all individual differences and partner experiences exert their influence on relationship quality via a person’s own relationship-specific experiences, and effects due to moderation by individual differences and moderation by partner-reports may be quite small. Finally, relationship-quality change (i.e., increases or decreases in relationship quality over the course of a study) was largely unpredictable from any combination of self-report variables. This collective effort should guide future models of relationships

Goldsmiths Research Online

Crossref

VU Research Portal

Kölner UniversitätsPublikationsServer

Serveur académique lausannois

eScholarship - University of California

ZORA

Utrecht University Repository